Hello PixieDust!

This sample notebook provides you with an introduction to many features included in PixieDust. You can find more information about PixieDust at https://pixiedust.github.io/pixiedust/. To ensure you are running the latest version of PixieDust uncomment and run the following cell. Do not run this cell if you installed PixieDust locally from source and want to continue to run PixieDust from source.


In [ ]:
#!pip install --user --upgrade pixiedust

Import PixieDust

Run the following cell to import the PixieDust library. You may need to restart your kernel after importing. Follow the instructions, if any, after running the cell. Note: You must import PixieDust every time you restart your kernel.


In [ ]:
import pixiedust

Enable the Spark Progress Monitor

PixieDust includes a Spark Progress Monitor bar that lets you track the status of your Spark jobs. You can find more info at https://pixiedust.github.io/pixiedust/sparkmonitor.html. Run the following cell to enable the Spark Progress Monitor:


In [ ]:
pixiedust.enableJobMonitor();

Example use of the PackageManager

You can use the PackageManager component of Pixiedust to install and uninstall maven packages into your notebook kernel without editing configuration files. This component is essential when you run notebooks from a hosted cloud environment and do not have access to the configuration files. You can find more info at https://pixiedust.github.io/pixiedust/packagemanager.html. Run the following cell to install the GraphFrame package. You may need to restart your kernel after installing new packages. Follow the instructions, if any, after running the cell.


In [ ]:
pixiedust.installPackage("graphframes:graphframes:0.1.0-spark1.6")
print("done")

Run the following cell to print out all installed packages:


In [ ]:
pixiedust.printAllPackages()

Example use of the display() API

PixieDust lets you visualize your data in just a few clicks using the display() API. You can find more info at https://pixiedust.github.io/pixiedust/displayapi.html. The following cell creates a DataFrame and uses the display() API to create a bar chart:


In [ ]:
sqlContext=SQLContext(sc)
d1 = sqlContext.createDataFrame(
[(2010, 'Camping Equipment', 3),
 (2010, 'Golf Equipment', 1),
 (2010, 'Mountaineering Equipment', 1),
 (2010, 'Outdoor Protection', 2),
 (2010, 'Personal Accessories', 2),
 (2011, 'Camping Equipment', 4),
 (2011, 'Golf Equipment', 5),
 (2011, 'Mountaineering Equipment',2),
 (2011, 'Outdoor Protection', 4),
 (2011, 'Personal Accessories', 2),
 (2012, 'Camping Equipment', 5),
 (2012, 'Golf Equipment', 5),
 (2012, 'Mountaineering Equipment', 3),
 (2012, 'Outdoor Protection', 5),
 (2012, 'Personal Accessories', 3),
 (2013, 'Camping Equipment', 8),
 (2013, 'Golf Equipment', 5),
 (2013, 'Mountaineering Equipment', 3),
 (2013, 'Outdoor Protection', 8),
 (2013, 'Personal Accessories', 4)],
["year","zone","unique_customers"])

display(d1)

Example use of the Scala bridge

Data scientists working with Spark may occasionaly need to call out to one of the hundreds of libraries available on spark-packages.org which are written in Scala or Java. PixieDust provides a solution to this problem by letting users directly write and run scala code in its own cell. It also lets variables be shared between Python and Scala and vice-versa. You can find more info at https://pixiedust.github.io/pixiedust/scalabridge.html.

Start by creating a python variable that we'll use in scala:


In [ ]:
python_var = "Hello From Python"
python_num = 10

Create scala code that use the python_var and create a new variable that we'll use in Python:


In [ ]:
%%scala
println(python_var)
println(python_num+10)
val __scala_var = "Hello From Scala"

Use the __scala_var from python:


In [ ]:
print(__scala_var)

Sample Data

PixieDust includes a number of sample data sets. You can use these sample data sets to start playing with the display() API and other PixieDust features. You can find more info at https://pixiedust.github.io/pixiedust/loaddata.html. Run the following cell to view the available data sets:


In [ ]:
pixiedust.sampleData()

Example use of sample data

To use sample data locally run the following cell to install required packages. You may need to restart your kernel after running this cell.


In [ ]:
pixiedust.installPackage("com.databricks:spark-csv_2.10:1.5.0")
pixiedust.installPackage("org.apache.commons:commons-csv:0")

Run the following cell to get the first data set from the list. This will return a DataFrame and assign it to the variable d2:


In [ ]:
d2 = pixiedust.sampleData(1)

Pass the sample data set (d2) into the display() API:


In [ ]:
display(d2)

You can also download data from a CSV file into a DataFrame which you can use with the display() API:


In [ ]:
d3 = pixiedust.sampleData("https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv")

PixieDust Log

PixieDust comes complete with logging to help you troubleshoot issues. You can find more info at https://pixiedust.github.io/pixiedust/logging.html. To access the log run the following cell:


In [ ]:
% pixiedustLog -l debug

Environment Info.

The following cells will print out information related to your notebook environment.


In [ ]:
%%scala
val __scala_version = util.Properties.versionNumberString

In [ ]:
import platform
print('PYTHON VERSON = ' + platform.python_version())
print('SPARK VERSON = ' + sc.version)
print('SCALA VERSON = ' + __scala_version)

More Info.

For more information about PixieDust check out the following:

PixieDust Documentation: https://pixiedust.github.io/pixiedust/index.html

PixieDust GitHub Repo: https://github.com/ibm-watson-data-lab/pixiedust